Domain-independent term extraction through domain modelling

نویسندگان

  • Georgeta Bordea
  • Paul Buitelaar
  • Tamara Polajnar
چکیده

Extracting general or intermediate level terms is a relevant problem that has not received much attention in literature. Current approaches for term extraction rely on contrastive corpora to identify domainspecific terms, which makes them better suited for specialised terms, that are rarely used outside of the domain. In this work, we propose an alternative measure of domain specificity based on term coherence with an automatically constructed domain model. Although previous systems make use of domain-independent features, their performance varies across domains, while our approach displays a more stable behaviour, with results comparable to, or better than, state-of-the-art methods. Term extraction plays an important role in a wide range of applications including information retrieval (Yang et al., 2005), keyphrase extraction (Lopez and Romary, 2010), information extraction (Yangarber et al., 2000), domain ontology construction (Kietz et al., 2000), text classification (Basili et al., 2002), and knowledge mining (Mima et al., 2006). In many of these applications the specificity level of a term is a relevant characteristic, but despite the large body of work in term extraction there are few methods that are able to identify general terms or intermediate level terms. Take for example the following structure from the AGROVOC vocabulary1: resources → natural resources→ mineral resources→ lignite, where resources is an upper level term, natural resources and mineral resources are intermediAGROVOC: http://aims.fao.org/ standards/agrovoc/about ate level terms, and lignite is a leaf. Intermediate level terms are specific to a domain but are broad enough to be usable for summarisation and classification. Methods that make use of contrastive corpora to select domain specific terms favour the leaves of the hierarchy, and are less sensitive to generic terms that can be used in other domains. Instead, we construct a domain model by identifying upper level terms from a domain corpus. This domain model is further used to measure the coherence of a candidate term within a domain. The underlying assumption is that top level terms (e.g., resource) can be used to extract intermediate level terms, in our example natural resources and mineral resources. Our method for constructing a domain model is evaluated directly through an expert survey as well as indirectly based on its contribution to intermediate level term extraction. While domain modelling is tested and exemplified with English, the ideas presented here are not language dependent and can be applied to other languages, but this is outside the scope of this work. We start by giving an overview of related work in term extraction in Section 1. Then, an approach to construct a domain model based on domain coherence is proposed in Section 2, followed by a method to apply domain models for term extraction. The experimental part of the paper starts with a direct evaluation of a domain model through a user survey (Section 3). A first set of experiments is carried in a standard setting for term evaluation, while the second set of experiments is applicationdriven, using corpora annotated for keyphrase extraction, information extraction, and information retrieval. We conclude this paper in Section 4, giving a few directions for future work.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modelling and application: A research domain in mathematics education

Abstract: The main purpose of this paper is introducing modelling and application as a research domain in mathematics education through reviewing related literature. The first purpose of this review is to give a more clear meaning of modelling and application, and base on that, makes the distinction between modelling in mathematics education & modelling in other scientific domains. There are so...

متن کامل

Chinese Term Extraction Using Minimal Resources

dentify fea relatively stable and domain independent term delimiters rather than that of the terms. For term verification, a link analysis based method is proposed to calculate the relevance between term candidates and the sentences in the domain specific corpus from which the candidates are extracted. The proposed approach requires no prior domain knowledge, no general corpora, no full segment...

متن کامل

Automatically Selecting Domain Markers for Terminology Extraction

Some approaches to automatic terminology extraction from corpora imply the use of existing semantic resources for guiding the detection of terms. Most of these systems exploit specialised resources, like UMLS in the medical domain, while a few try to take profit from general-purpose semantic resources, like EuroWordNet (EWN). As the term extraction task is clearly domain depending, in the case ...

متن کامل

Chinese Term Extraction Based on Delimiters

Existing techniques extract term candidates by looking for internal and contextual information associated with domain specific terms. The algorithms always face the dilemma that fewer features are not enough to distinguish terms from non-terms whereas more features lead to more conflicts among selected features. This paper presents a novel approach for term extraction based on delimiters which ...

متن کامل

DiLiA - a Digital Library Assistant - A New Approach to Information Discovery through Information Extraction and Visualization

This paper presents preliminary results of our current research project DiLiA (Digital Library Assistant). The goals of the project are are twofold. One goal of the project is the development of domain-independent information extraction methods. The other goal is the development of information visualization methods that interactively support researchers at time consuming information discovery t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013